Modeling and verifying a broad array of network properties 



Vladimir Filkov,^'!^] Zachary M. Saul,^ Soumen Roy,^'^ Raissa M. D'Souza,^'^ and Premkumar T. Devanbu^ 

^Department of Computer Science, University of California, Davis, CA 95616 
^Center for Computational Science and Engineering, University of California, Davis, CA 95616 
^Department of Mechanical and Aeronautical Engineering, University of California, Davis, CA 95616 

Motivated by widely observed examples in nature, society and software, where groups of related 
nodes arrive together and attach to existing networks, we consider network growth via sequential at- 
tachment of linked node groups, or graphlets. We analyze the simplest case, attachment of the three 
node \/-graphlet, where, with probability a, we attach a peripheral node of the graphlet, and with 
probability (1 — a), we attach the central node. Our analytical results and simulations show that 
tuning a produces a wide range in degree distribution and degree assortativity, achieving assortativ- 
ity values that capture a diverse set of many real- world systems. We introduce a fifteen-dimensional 
attribute vector derived from seven well-known network properties, which enables comprehensive 
comparison between any two networks. Principal Component Analysis of this attribute vector space 
shows a significantly larger coverage potential of real- world network properties by a simple extension 
of the above model when compared against a classic model of network growth. 

PACS numbers: 89.75.Hc,89.75.Fb 



The ubiquity and importance of network structures has 
recently become apparent, leading to an increased focus 
on network growth mechanisms [1 . Existing models of 
network growth primarily consider the arrival of single 
nodes at each time step; however, there are numerous ex- 
amples in natural and artificial systems where networks 
grow not just by the addition of single nodes but by the 
addition of groups of already related nodes. For example, 
in biology, in developmental transcriptional gene regula- 
tion, whole pathways can be added or eliminated by a 
mutation in a master regulator [3 ; and in the evolution 
of biological networks, gene duplication can add subnet- 
works to the network [4 . Growth of computer software 
networks (composed of interacting functions or classes) is 
often due to adding small groups of related elements si- 
multaneously. For example, 1) functions to allocate, use, 
and free a resource (such as a file) are usually added to- 
gether and 2) in object-oriented languages, good design 
principles call for classes to be added in small groups 
called design patterns [5 . Further, in social networks 
within cities, families arrive as units, and growth can be 
described via aggregation of small pre-existing modules. 
Similarly, in corporate enterprises, the practice of "lift- 
outs" , employing pre-existing functional teams of people 
(rather than building up a team from individual hires) , is 
on the rise [6 . This insight suggests that a new class of 
network growth models incorporating group arrival could 
lead to more realistic models. Moreover, most existing 
work in modeling network growth focuses on matching a 
single or few attributes of empirical networks, in particu- 
lar degree distribution, clustering coefficient, etc. But 
networks can differ in many ways while being similar 
in others, e.g. some with the same degree distribution 
have different levels of assort at ive mixing. Thus, a more 
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comprehensive comparison, simultaneously across many 
important attributes, is desirable. 

Hence, the purpose of this letter is two-fold. Our first 
intent is to propose modeling network growth by sequen- 
tial aggregation of groups of nodes, represented by small, 
connected graphs or graphlets attaching preferentially in 
the network, rather than by preferential attachment of 
single nodes. Thus, we introduce the graphlet arrival 
model and show that in spite of its added complexity im- 
portant analytical results can be obtained. The model 
based on iteratively adding the three-node \/"S^^phl^t 
yields networks with degree distributions (the distribu- 
tion of the probability of observing a node of degree k) 
that follow an asymptotic power law, i.e., pk ^ ^ 
where, 7 is a parameter ranging from 3 < 7 < 5, in 
agreement with those found in a number of highly-cited 
studies of real- world systems where graphlets could play 
a crucial role We also analytically derive the degree 
assortativity, p, a measure of the tendency of nodes to 
link to nodes of like degree, which has the power to dis- 
criminate between empirical networks from various fields, 
even if they have similar degree distributions [8[ [9] . As 
noted recently [8 , an interesting open problem is to come 
up with a single growth model which could generate net- 
works of both positive assortativity, like social networks, 
and negative assortativity, like technological and biolog- 
ical networks. We find that our model yields tunable 
assortativity, with respect to a parameter, a, which de- 
termines the graphlet attachment point probability, as 
explained below. Our numerical results for networks up 
to :^ 10^ nodes in size (which covers most real- world net- 
works) show assortative behavior, {p > 0), for lower a 
and dissortative behavior, (p < 0), for higher values of a. 
Our analytical calculations show that p > for infinite 
size networks. 

The second intent of this letter is to introduce tech- 
niques for comprehensively comparing networks across a 
suite of network properties simultaneously, allowing for a 
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FIG. 1: The Growth Process. The \/-graphlet arrives and 
probabihty 1 — a) or its periphery (with probabiUty a). Here 
(left) and a = 1 (right). Already the creation/surpression of 
degree distribution for a = 1. Networks grown with < a < 1 

much more in-depth evaluation of network models than 
is possible using the commonly existing practice of com- 
paring primarily degree distribution. To that end, we 
compare the ability of our model networks to match the 
variability of 113 real networks under 15 attributes, and 
demonstrate how data mining methods like clustering 
and statistical dimension reduction (via Principal Com- 
ponent Analysis) can be utilized to assess that match. 
A simple extension of our model yields remarkably large 
coverage of the attribute space spanned by the 113 real 
networks, and a significant match of the ranges of real 
networks over all attributes. 

To fully model with the graphlet arrival paradigm, one 
must decide on which graphlet (s) to use, with which of 
their nodes to attach, and where in the network to attach 
them. Common undirected graphlets include the dyad 
(edge), the two triads (3 nodes) and the six tetrads (4 
nodes) . To properly analyze their arrival and attachment 
into the network one must classify the graphlets' nodes 
into equivalence classes based on symmetry. Our model, 
illustrated in Fig. [T] considers the simplest non-trivial 
case: series of arriving triads consisting of a single node of 
degree two and two identical nodes of degree one, which 
we call the \/-graphlet. This graphlet 's asymmetry pro- 
vides a choice of two topologically different attachment 
points (the two nodes of degree one are equivalent but 
different than the single node of degree two), unlike the 
edge and triangle graphlets which allow only one. The 
graphlets attach to the network by merging one of their 
vertices into an existing node selected with probability 
proportional to the node's degree, i.e., via preferential 
attachment. The model chooses the degree-one merge 
point with probability a and the degree-two merge point 
with probability (1 — a). 

First, we derive the asymptotic degree distribution, p/^, 
for the \/-gTdiphlet arrival model via a master equation 
approach. Starting with a single edge at time t = 0, the 
number of nodes at time t is N{t) = 2t + 2 2t, for 
large t. Let di{t) denote the degree of vertex i at time t. 




merges into the existing network at either its midpoint (with 
we show the process, after the arrival of 10 graphlets for a = 
hubs is evident as well as the more homogeneous nature of the 
show behaviors intermediate between these two. 



Then, the probability that incoming graphlet j merges 
with node i is pj^i = = ^ = where 

^idi{t) = 2N(t) as there is one edge for each node in 
the graph. Let Nk{t) be the number of nodes with degree 
k at time t. Due to the asymmetry of the we get 
separate equations of Nkit) for /c > 3, /c = 2 and k = 1. 
Making the natural assumption that Pk(f) = Nf^{t)/N{t) 
and assuming steady-state {pk{t) Pk) leads to Nk{t) = 
2tpk' From this and the Nk{t) equations, which may be 
detailed elsewhere [1^, we get : 

Pk>3 = a[{k-l)/{k^4)]pk-iH^-a)[{k-2)/{k^4)]pk-2. 

with p2 = ^{7 - a) and pi = |(2 - a). Since pk>3 
depends on both Pk-i and Pk-2 non-trivially, we cannot 
solve it analytically. However for large /c, a simple linear 
approximation results in 7 = (6 — a) /(2 — a). The results 
of numerical solutions are shown in the inset to Fig. [2] 

The degree assort at ivity, p, is defined as the Pearson 
correlation coefficient between the degrees of all pairs of 
connected vertices in the network [8]. Here, using a rate 
equation approach [TTl [12] , we directly calculate p from 
Cki^ the probability distribution that an edge in an undi- 
rected graph is incident to vertices of degree k and and 
Pk, the degree distribution [10 . Let Eki{t) denote the 
number of edges with a vertex of degree k at one end 
and a vertex of degree / at the other at time t. We note 
that "^kyi ^ki{t) = 2t-\-2 ^ 2t^ for large which implies 
Eki = 2teku foi" steady state. To derive a rate equation 
for Eki{t) we account for the processes that change it 
when a new V arrives. The processes that increase Eki 
are when: with probability (1 — a), a \/ merges its mid- 
point to a vertex of degree /c — 2, which is already attached 
to a vertex of degree / (and the same argument with k 
and / reversed); with probability a, a V merges one of its 
endpoints to a vertex of degree k — 1 (respectively / — 1), 
which is already attached to a vertex of degree / (respec- 
tively /c); in the special case when /c = 1, with probability 
(1 — a), a V merges its midpoint to a vertex of degree 
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FIG. 2: (Color online) Points show the mean p over 100 simu- 
lations of 10^ node networks. Bars represent values within two 
standard deviations. Solid line is the theoretical prediction for 
networks with maximum degree 2500. Inset: Distribution of 
average degree for a — 0.2 and a = 0.7 over an ensemble of 
5000 realizations, together with best fitting lines with slopes 
equal to the, respective, analytical 7's of 3.2 and 4.1. 

/ — 2, producing two new edges, each incident to vertices 
of degree / and 1; in the special case when /c = 2, with 
probability a, a \/ merges one of its endpoints to a vertex 
of degree / — 1, producing one new edge incident to ver- 
tices of degree / and 2. The processes that decrease E^i 
are when: with probability (1 — a) a new V merges its 
midpoint to a vertex of degree k (respectively which 
is already attached to a vertex of degree / (respectively 
k)] with probability a a new V merges one of its end- 
points to a vertex of degree k (respectively /), which is 
already attached to a vertex of degree / (respectively k). 
From these cases, and incorporating preferential attach- 
ment (by multiplying the number of edges gained or lost 
by m/4t, where m is the degree of the node to which the 
new V is attached), we derive a rate equation for Eki'. 

(1 - a)[Ek_2,i{t){k - 2) + Ek,i-2{t){l - 2) + 2Ni_2{l - 2)(5fc,i] + 
a[Ek-i,i{t){k - 1) + Ek^i_^{t){l - 1) + Ni_^{l - l)Sk,2] - 
Eki(t)(k + l), 

where dij is the Kronecker delta. Substituting Eki{t) = 
2teki and N}.(t) = 2tpk eliminates time from this equa- 
tion and yields expressions for e/c>3^^, ei^^>3 and e2,z>3. 
To initialize the recurrences we similarly calculate en = 
0, 621 = ei2 = 2a/7, and 622 = ci^(ei2+Pi)/8. In addition, 
because of the symmetry of the Eij terms, and since the 
edges are undirected when i = we are over-counting so 
we divide e^j by 2. Conversely, when < |i — j| < 2, we 
are under-counting and so we multiply Cij by 2. There- 
fore, the e/e^'s (and hence, p) can be calculated for 
a given value of a. A plot of p versus a can be seen 
in Fig. [2j where a good agreement is apparent with net- 
works simulated from our model. Previous attempts to 
create a model that admitted varying p values worked by 
rewiring the edges of an existing network [i3\. In con- 




FIG. 3: (Color online) Illustration of the \/^ model. Once the 
graphlet attaches to the network, based on a, we introduce 
up to / (here / = 4) additional edges from the graphlet into 
random nodes of the network, each with probability p. 

trast, our model grows networks with a range of negative 
and positive p values from first principles, giving insight 
into how assort at ivity may arise in networks. We note 
that in our experiments p approaches from the neg- 
ative side for a < 2/3, but it does so very slowly and 
is negative for all networks we tried (up to 10^ nodes). 
It can be shown that [T0| in the thermodynamic limit 
Newman's original formula for p [8 yields p = when 
a < 2/3. 

The \/"g^^phlet arrival model always produces trees 
and hence is not expected to match empirical networks 
on some interesting properties (such as clustering coef- 
ficient). Therefore, we examine a simple extension to 
the V-^odel which allows it to produce denser graphs, 
without significantly affecting the model's degree distri- 
bution and assort at ivity features. The extended model, 
illustrated in Fig. [3| adds with probability p at each 
time step, / edges (or dyads) from the arriving graphlet 
into the existing network, with the attachment points 
being chosen uniformly at random. In addition to al- 
lowing denser graphs, this ^^\/ p-modeV^ also reflects the 
behavior in various real- world networks, where a newly 
arriving graphlet may attach to the existing network at 
more than one point (e.^., new families arriving in a city, 
etc.). A theoretical analysis for the extended model is 
very complex. Instead, in the following model compar- 
ison we simulate networks for many values across the 
possible parameter space (a,/3, /). 

Existing network literature compares networks or net- 
work models by studying one or two particular proper- 
ties, and most commonly the degree distribution. In this 
letter, we introduce a fifteen-dimensional attribute vec- 
tor of seven well-known network properties, which should 
enable a general and comprehensive comparison between 
any set of networks. These properties are: the number of 
nodes, the number of edges, the geodesic distribution, 
the betweenness coefficient distribution, the clustering 
coefficient distribution, the assort at ivity, and the degree 
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FIG. 4: (Color online) Symmetric heatmap of attribute cor- 
relations among networks. Red (blue) indicates perfect cor- 
relation (anti-correlation). White is the intermediate case of 
no correlation. The small amount of clustering along the di- 
agonal attests to the relative independence of the attributes. 



distribution of the network. For the four distributions, 
we use the mean, standard deviation, and skewness as 
proxy attributes, for a total of 15 attributes. Networks 
are mapped to points in a 15-dimensional space defined 
by these attributes, normalizing each value by subtract- 
ing the attribute mean and dividing by the attribute's 
standard deviation. 

Our collection of real- world networks consists of 113 
diverse networks from biological, social and technical do- 
mains. It includes software call graphs [14 , a social 
network of software developers [15^, political social net- 
works [16l |T7], 3 gene networks [181 ttSl ISO], 3 protein- 
protein interaction networks (21], cellular networks for 
several organisms [22 , and several others downloaded 
from a web repository of networks [23l [24 l [25 l [26 l [27 | 
[28l \29\ [30] , The degree of overlap, or dependence, be- 
tween the attributes when characterizing networks can 
be assessed by the symmetric heatmap in Fig. |4j show- 
ing the pairwise correlations (Pearson) of the network 
attributes over a representative sample of real- world net- 
works (one from each data set described above). The 
rows and columns of the heatmap are ordered so that, 
within the limitations of the hierarchical clustering used, 
the attributes most correlated with each other are placed 
closest. The map allows us to identify clusters of "simi- 
lar" network attributes by looking for blocks of squares 
along the diagonal of the figure. Since there is only a 
small amount of clustering along the diagonal, it follows 
that most network attributes we have chosen are rela- 
tively independent, and thus, provide information to our 
analysis. 

In the following analysis we have eliminated 4 of the 15 
network attributes and retained 11. One reason, is that 
some attributes, like number of nodes and edges were 
tightly correlated as indicated in the heatmap. So we 
only kept one of them, the number of edges. Another 



reason is that since the Vth moment of a power-law dis- 
tribution, p{k) ~ k~^, is only defined for / < (7 — 1), 
we have omitted the skew of the degree distribution, as a 
precaution. For the same reason, the variance and skew 
of the betweenness distribution have been left out, even 
though the exact nature of the betweenness distribution 
does not seem to be known. The distribution of the clus- 
tering coefficient and geodesic are defined for the models 
investigated in this paper [34 and have hence been re- 
tained. 

Next, we compare a collection of \/ ^-diTiiYal growth 
networks to the above collection and to a baseline collec- 
tion of networks from the well-known B A model [ST . We 
chose BA as a baseline because, like BA, our graphlet- 
arrival model uses the mechanism of preferential attach- 
ment, only instead of nodes we have graphlets arriving. 
We sample a large swath of the parameter space for the 
V^-arrival model, iterating across several possible val- 
ues for each parameter and creating networks that cover 
the size range of real-world networks. To this end, we 
use network sizes ranging from 500 to 5250 nodes at 250 
node intervals, a values in the range < a < 1 at inter- 
vals of 0.1, P values in the range < < 1 at intervals 
of 0.1, and / values in the range 1 to 5. For each possible 
combination of values of these four parameters, we create 
five networks, giving us a total of 60, 500 networks. For 
the BA model, we generate 500 sample networks by vary- 
ing the number of nodes in the same range as our model 
(with identical increments), varying the number of edges 
added at each attachment from 1 to 5, and creating 5 
sample networks for each possible combination of these 
two parameters. 

To objectively assess the extent to which our model 
networks cover the range of attributes simultaneously, 
we visualize the attribute space using an established sta- 
tistical dimension-reduction technique. Principal Compo- 
nents Analysis (PGA), which guarantees maximal reten- 
tion of the variance when projecting data into a lower 
dimension PGA finds the projection of an n- 

dimensional data set onto a space of the same dimension, 
where the new axes, or principal components^ are orthog- 
onal and linear combinations of the original dimensional 
variables, such that the first d axes, d < retain the 
maximal variance of the original data set possible with 
that many dimensions. Fig. [5] shows the projections of 
the sets of \//3 model, BA model, and real-world net- 
works onto the first three principal components (out of 
11) of the real- world data set found by the PGA algo- 
rithm. These principal components retain 71% of the 
original data variance and demonstrate the larger cover- 
age potential of the extended graphlet arrival model. We 
note that these results are fairly stable with respect to 
the number of variables used in the PGA analysis: using 
between 2 — 4 fewer (or more) than the 11 variables does 
not qualitatively change the results [10]. While PGA has 
been used before to cluster networks [33], our methodol- 
ogy here is novel in that it offers a general and explicit 
way to compare growth models relative to each other. 
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with respect to the fraction of PCA space they cover. 
Additionahy, it ahows for models to be compared more 
finely, along individual or combinations of original vari- 
ables, by projecting those variable vectors in the same 
PCA space, e.g. assortativity in Fig. [5) and then observ- 
ing the spread differentials between the model networks 
along those vectors. 

In conclusion, graphlet arrival models are a positive 
step toward more realistic network models which, as we 
show, better approximate empirical networks in biology, 
software, and social science, both in the modeling step 
(graphlet versus node arrival) and in the results (match- 
ing more complex measures of networks, like assortativ- 
ity). A broad degree distribution and wide variation of 
assortativity are features of the \/-arrival model which 
are not present in preferential attachment models that 
grow via individual nodes, and/or edges. In particu- 
lar, we believe that the attachment asymmetry of the 



V-graphlet is largely responsible for these features and 
that they would not be apparent in a graphlet model of 
fully connected graphlets (e.g edge, triangle, or square). 
Therefore, we expect more complete graphlet arrival 
models (whose theoretical analysis would also be more 
complex), considering a larger set of possible graphlets 
to yield even better models of empirical networks (we 
also note that the addition of simpler graphlets should 
expand the range of possible 7's to below 3, where the 
exponents of most real-world networks with power-law 
degree-distributions reside). Finally, we anticipate that 
the technique of comprehensive comparison of networks 
across a suite of network properties introduced in this 
letter, would find wider use in the network literature. 
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improved this paper. 
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FIG. 5: (Color online) A projection of our model nets (orange points), the BA model nets (light blue), and real- world nets 
(black) onto the first three principal components of the eleven dimensional PCA space of our real-world data set (we omitted 
the number of nodes, skew of the degree distribution, and variance and skew of the betweenness distribution from the original 
15 attributes, as described in the text). Here, the PCAl axis is primarily composed of (in terms of their coefficient's magnitude) 
a combination of the number of edges, mean and skew of geodesic, mean and st. dev. of clustering, and mean and st. dev. of 
degree. PCA2 is mainly a combination of the st. dev and mean of geodesic, and assortativity. PC A3 is mainly a combination 
of the mean of betweenness, mean of clustering, number of edges, and st. dev of degree. As an example of a spread along an 
original parameter, the grey arrow is parallel to and shows the direction and magnitude of assortativity when projected onto 
this space. 



